Derivative: Instantaneous rate of change
Note: We most often care about when derivatives are 0. Why?
Integrals: Area under the curve \(\int_{-\infty}^{-1} f(x) dx\)
Note: If a curve is a probability distribution (proper) what is the AUC?
\[f = x^2 + xy + y^2\]
How does \(f(x,y)\) when \(x\) changes? when \(y\) changes?
\[\frac{\partial f}{\partial x} = 2x + y\]
\[\frac{\partial f}{\partial y} = x + 2y\]
Shove those in a vector, and you’ve got a gradient.
\[ \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x + y \\ x + 2y \end{bmatrix} \]
Note: what do gradients tell us about a function?
Covariance: Do two variables move together? \[Corr(x,y) = \frac{Cov(x,y)}{sd(x)sd(y)}\]
gradients: first derivatives of a multivariable function \(f(x,y)\)
hessians : second derivatives of a multivariable function \(f(x,y)\)
\[\begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial f^2}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \\ \end{bmatrix}\]
Note: What does a second derivative tell you in a single variable function \(f(x)\)?
🚗
Position: \(r(t)\)
Velocity: \(\frac{dr}{dt}\)
Acceleration: \(\frac{d^2r}{dt^2}\)
\[f = x^2 + xy + y^2\]
\[\frac{\partial f}{\partial x} = 2x + y; \frac{\partial f}{\partial y} = x + 2y\] \[\begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial f^2}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \\ \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \\ \end{bmatrix}\]
Where is \(f''(x) = 0\)? positive? negative?
a design matrix in a linear model is a matrix of all the explanatory variables, \(X\).
\[y = X\beta\]
simple case: columns of \(X\) are each vectors of predictor variables (e.g. height, age…)
complicated case: one-hot encoding, intercept, polynomial terms
\[\underset{n\times (p+1)}{X} = \begin{bmatrix} 1 & 24 & 170 \\ 1 & 21 & 162 \\ ... & ... & ... \\ 1 & 22 & 150 \\ \end{bmatrix}; \underset{(p+1)\times 1}{\beta} = \begin{bmatrix}3.2 \\ 0.1 \\ 0.02 \end{bmatrix}\]
Using a design matrix \(X\) makes writing out the math easier. \(Y\) is (usually) a \(1\times n\) vector of responses, \(X\) is a \(n \times (p+1)\) matrix of predictors…plus a little extra.
\[ y = X \beta \] It’s just matrix shorthand for the traditional linear regression formula.
\[ y = \beta_0 + \beta_1x_1 + ...\beta_nx_n \]
\[ y = \beta_0 + \beta_1x_1 + ...\beta_nx_n \]
🔮 notice in the formula: does changing any of the \(x_n\)s impact the other \(x\)s?
\[ y = X \beta \]
The design matrix also allows us to write this problem in a format that makes it clear that least squares can help us choose \(\beta\) in a way that minimized \(Y - \beta X\)!
\[ y = X \beta \]
Often, there is no \(\beta\) that can perfectly predict \(y\) (and if there is…maybe it’s overfitting…)
We want to choose a line in the column space of \(X\) that is as close as possible to \(y\) as possible. In other words: minimize \(\parallel y- X\beta \parallel^2\).
The best \(\beta\) we can choose that is both:
is the projection (“shadow”) of \(y\) on \(C(X)\).
So our best fit is the projection of \(y\) onto the column space \(C(X)\):
\[ X \beta = proj_{C(X)} y \]
A little mathy-math
\[ X\beta - y = proj_{C(X)} y - y \]
\(proj_{C(X)} y - y\) is orthogonal to \(C(X)\), and thus is in the nullspace of \(X^T\).
what does that mean?
\[ X^T(X\beta - y) = 0 \]
\[ X\beta - y \in C(X)^\perp \rightarrow \\ C(X)^\perp = N(X^T) \rightarrow \\ X^T(X\beta - y) = 0 \]
doing a little LA: \[ X^TX\beta - X^Ty = 0 \\ X^TX\beta = X^Ty \\ (X^TX)^{-1}X^TX\beta = (X^TX)^{-1}X^Ty \\ \]
(for example, if there is perfect collinearity: total sales, US sales, non-US sales, \(X^TX\) is not invertible, \(X\) is not of full rank, and there exists multiple solutions!)
the determinant of \(A\) is the area of the unit square after multiplying by \(A\)
\[ \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} \]
the determinant of \(A\) is the area of the unit square after multiplying by \(A\)
\[ \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix} \]
trace: sum of the diagonal elements
\[ \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix} \]
\(tr \left ( \underset{n \times n}{A} \right ) = \sum_{i=1}^n \lambda_i\)
We can use SVD to do PCA by decomposing the data matrix \(X\) rather than eigendecomposing the covariance matrix \(C = \frac{X^TX}{n-1}\) into \(VLV^T\) where \(V\) are the eigenvectors (PCs) and \(L\) contains the eigenvalues.
if \(X = UDV^T\), \(V\) contains the eigenvectors needed to create PCs, and \(D\) contains the (scaled, root) eigenvalues for each PC.
\[ C = \frac{VDU^TUDV^T}{n-1} = \\ V\frac{D^2}{n-1}V^T \]
source: Wikipedia
\(Ax\)
\[ B = \begin{bmatrix} 1 & 2 \\ 0.5 & 1 \end{bmatrix}, A = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} \]
\[ Ax = \lambda x \]
\[ \begin{bmatrix} 1 & 3\\ 1 & -1 \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt2} & \frac{1}{\sqrt2}\\ \frac{1}{\sqrt2} & -\frac{1}{\sqrt2} \end{bmatrix} \begin{bmatrix} \sqrt2 & \sqrt2\\ 0 & 2\sqrt2 \end{bmatrix} \]
Solve \(X\beta = y\) subject to \(\arg \underset{\beta}{\min}\left\| X\beta - y \right\|_2\)
Could do
\[ \left( X\beta - y\right)^T\left( X\beta - y\right) \\ \beta^TX^TX\beta - \beta^TX^Ty - y^TX\beta + y^Ty \\ \beta^TX^TX\beta- 2y^TX\beta + y^T y \\ \nabla_\beta [\beta^TX^TX\beta- 2 y^TX\beta + y^Ty] = 2X^TX\beta - 2X^T y \\ 2X^TX\beta - 2X^T y = 0 \\ X^TX\beta = X^Ty \\ \beta = \underbrace{\left(X^TX\right)^{-1}X^T}_\text{Moore-Penrose Inverse} y \]
But finding inverses SUCKS.
Substitute \(A = QR\).
\[ A^TAx = A^Tb \to \left(QR \right)^T\left(QR \right)x = \left(QR \right)^Tb \\ R^TQ^TQRx = R^TQ^Tb \\ R^TRx = R^TQ^Tb \\ Rx = Q^Tb \\ Rx = v \]
Since \(R\) is upper triangular, it’s easy to solve with backsubstitution. No inverses!!
\[ x_1 + 2x_2 -x_3 = 3 \\ -2x_2 - 2x_3 = -10 \\ -6x_3 = -18 \] \[ A = \begin{bmatrix} 1 & 2 & -1 \\ 0 & -2 & -2 \\ 0 & 0 & -6 \end{bmatrix}, A\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix}3 \\ -10 \\ -18 \end{bmatrix} \]
For \(A\), a symmetric matrix,
Basically the Square Root of a Matrix (and a special case of \(LU\) decomp)
Cholesky Decomp can force vectors of random numbers to have a specified covariance!
Proof
Remember, \(\Sigma = LL^T\) and \(Z = LX\)
\[ \mathbb{E}\left( ZZ^T\right) = \mathbb{E}\left( (LX)(LX)^T\right) = \\ \mathbb{E}\left( LXX^TL^T\right) = \\ L\mathbb{E}\left(XX^T\right)L^T = \\ LIL^T = LL^T = \Sigma \]
How do we find this approximation?
\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n \]
How do we find this approximation? (we’re matching derivatives!*)
\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} (x)^n = \frac{f(0)}{0!} + \frac{f'(0)}{1!} x + \frac{f''(0)}{2!} x^2 + \frac{f'''(0)}{3!} x^3 + ... \]
Use?
\[ \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} (x)^n = \frac{f(0)}{0!} + \frac{f'(0)}{1!} x + \frac{f''(0)}{2!} x^2 \]